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Abstract 



In this note we illustrate and develop further with mathematics and examples, the 
work on successive standardization (or normalization) that is studied earlier by the same 
authors in [5] and [3]. Thus, we deal with successive iterations applied to rectangular 
arrays of numbers, where to avoid technical difficulties an array has at least three 
rows and at least three columns. Without loss, an iteration begins with operations on 
columns: first subtract the mean of each column; then divide by its standard deviation. 
The iteration continues with the same two operations done successively for rows. These 
four operations applied in sequence completes one iteration. One then iterates again, 
and again, and again,.... In [5] it was argued that if arrays are made up of real numbers, 
then the set for which convergence of these successive iterations fails has Lebesgue 
measure 0. The limiting array has row and column means 0, row and column standard 
deviations 1. A basic result on convergence given in [2] is true, though the argument in 
[2] is faulty. The result is stated in the form of a theorem here, and the argument for 
the theorem is correct. Moreover, many graphics given in [2] suggest that but for a set 
of entries of any array with Lebesgue measure 0, convergence is very rapid, eventually 
exponentially fast in the number of iterations. Because we learned this set of rules 
from Bradley Efron, we call it "Efron's algorithm" . More importantly, the rapidity of 
convergence is illustrated by numerical examples. 

Key words: rectangular arrays, successive iterations, standardization, exponentially fast 
convergence. 
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1 Introduction 

Research summarized in this paper is an extension of that reported in [2] and a conference 
proceeding [3], and concerns successive standardization (or normalization) of large rectan- 
gular arrays X of real numbers, such as arise in gene expression, from protein chips, or 
in the earth and environmental sciences. The basic message here is that convergence that 
holds on all but a set of measure in the paper by Olshen and Rajaratnam ([2] is shown 
here to be exponentially fast in a sense we make precise. The basic result in [2], though 
true is not argued correctly in [2]. The gap is filled here. Typically there is one column per 
subject; rows correspond to "genes" or perhaps gene fragments (including those that owe to 
different splicing of "the same" genes, or proteins). Typically, though not always, columns 
divide naturally into two groups: "affected" or not. Two-sample testing of rows that corre- 
spond to "affected" versus other individuals or samples is then carried out simultaneously 
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for each row of the array. Corrections for multiple comparisons may be very simple, or 
might perhaps allow for "false discovery." 

As was noted in [2], data may still suffer from problems that have nothing to do with 
differences between groups of subjects or differences between "genes" or groups of them. 
There may be variation in background, perhaps also in "primers." Thus, variability across 
subjects might be unrelated to status. In comparing two random vectors that may have been 
measured in different scales, one puts observations "on the same footing" by subtracting 
each vector's mean and dividing by its standard deviation. Thereby, empirical covariances 
are changed to empirical correlations, and comparisons proceed. But how does one do this 
in the rectangular arrays described earlier? An algorithm by which such "regularization" 
is accomplished was described to us by colleague Bradley Efron; so we call the full algo- 
rithm Efron's algorithm. We shall use the terms successive "normalization" or successive 
"standardization" interchangeably (and also point out that Bradley Efron considers the the 
latter term a better description of the algorithm). To avoid technical problems that are 
described in [2] and repeated here, we assume there are at least three rows and at least 
three columns to the array. It is immaterial to convergence, though not to limiting values, 
whether we begin regularization to be described by row or by column. In order that we fix 
an algorithm, we begin by column in the computations though by row in the mathematics. 
Thus, we first mean polish the column, then standard deviation polish the column; next we 
mean polish the row, and standard deviation polish the row. The process is then repeated. 
By "mean polish" of, say, a column, we mean subtract the mean value for that column from 
every entry. By "standard deviation" polish the column, we mean divide each number by 
the standard deviation of the numbers in that column. Definitions for "row" are entirely 
analogous. 

In [2] convergence is studied with entries in the rectangular array with I rows and J 
columns viewed as elements of W IJ , Euclidean IJ space. Convergence holds for all but a 
Borel subset of W IJ of Lebesgue measure 0. The limiting vector has all row and column 
means 0, all row and column standard deviations 1. We emphasize that to show convergence 
on a Lebesgue set of full measure, it is enough to find a probability mutually absolutely 
continuous with respect to Lebesgue measure for which convergence is established with 
probability 1. 



2 Preliminaries 

We begin precise definitions: 
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and analogously define (£• ) 2 for n = 2,3,.. By induction on dimension applied to 

regular conditional distributions, (Sf ) 2 > a.s. 

As explained in [2] the 2x2 case illustrates the need to work with dimensions greater 
than or equal to 3. Consider the following arbitrary 2x2 matrix 



If a < b and c < d, then 



so (Sf) 2 = 0. 

If o, b, c, d are, say, iid and have symmetric distributions, then 

P((sVf = o) = l. 

A moment's reflection shows that if X is I x 2, with n odd, then after each row normal- 
ization, each column has an odd number of entries, each entry being +1 or —1. However, 



X 
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c d 



-1 1 
-1 1 



each row has exactly one +1 and one -1. Thus (S^ - S, 
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is impossible. With 



min(J, J) > 3, both tend to 1 a.s. asn/ oo. The fact that {Ajj} iid implies in particular 
that X is row and column exchangeable. Thus, if 



X = [x ljC ,...x JjC ], 

where each column is / x 1, and it is a permutation of the integers {1,..., J}, then X 

[ X 7r(l),0 



, Xn(J),c]' T ne analogous holds for rows, where 



X 



x i, 



where 1 X J 



3 



3 Background and Motivation 



A first step in the argument of [2] is to note that Lebesgue measure on the Borel subsets of 
M. IJ is mutually absolutely continuous with respect to IJ product Gaussian measure, each 
coordinate being standard Gaussian. Thereby, the distinction between measure and topol- 
ogy is blurred; arguments of [2] as corrected here. Having translated a problem concerning 
Lebesgue measure to one concerning Gaussian measure, one cannot help note from graphs 
in |2j Figures 1, 2, 3, 6, and 7 that suggest with entries of X chosen independently from a 
common absolutely continuous distribution, and as led to these figures, almost surely ulti- 
mately, convergence is at least exponentially fast. In these graphs, the ordinate is always 
the difference of logarithms (the base does not matter) of squared Frobenius norm of the 
difference between current iteration and the immediately previous iteration; always abscissa 
indexes iteration. One purpose of this paper is to demonstrate the ultimately almost sure 
rapidity of convergence. Readers will note we assume that coordinates are independent and 
identically distributed, with standard normal distributions. This assumption is used explic- 
itly though unnecessarily in [2], but only implicitly here, and then only to the extent that 
arguments here depend on those in [2]. Obviously, this Gaussian assumption is sufficient. 
It is pretty obviously not necessary. 

For the sake of motivation we illustrate the patterns of convergence from successive 
normalizations on 5 x 5 matrices. We first describe the details of the successive normalization 
in a manner very similar to [2 J . Consider a simple 5x5 matrix with entries generated 
from a normal distribution with a given mean and standard deviation. In our case we take 
the mean to be 2 and variance to be 4, though the specific values the mean and variance 
parameter take do not really matter. We first standardize the initial matrix X^ ) at the 
level of each by row, i.e., first subtracting the row mean from each entry and then dividing 
each entry in a given row by its row standard deviation. The matrix is then standardized 
at the level of each column, i.e., by first subtracting the column mean from each entry and 
then by dividing each entry by the respective column standard deviation. Row mean and 
standard deviation polishing followed by column mean and standard deviation polishing 
is defined as one iteration in the process of attempting to row and column standardize 
the matrix. The resulting matrix is defined as X*- 1 ). Now the same process is repeated 
with XW and repeated until successive renormalization eventually yields a row and column 
standardized matrix. The successive normalizations are repeated until "convergence" which 
for our purposes is defined as the difference in the squared Frobenius norm between two 
consecutive iterations being less than 10~ 8 . 

The figures (see Figures 1-2) are plots of the log of the ratios of the squared Frobenius 
norms of the differences between consecutive iterates. In particular, they capture the type 
of convergence patterns that are observed in the 5x5 case from different starting values. 
In [2\ it was proved that regardless of the starting value (provided the dimensions are at 
least 3) the process of successive normalization always converges, in the sense that it leads 
to a doubly standardized matrix. In addition, it was noted that the convergence is very 
rapid. We can see empirically from Figures 1-11 that eventually the log of the successive 
squared differences tends to decay in a straight line, i.e., the rate of convergence is perhaps 
exponential. This phenomenon of linear decay between successive iterations in the log scale 
is observed in all the diagrams. Hence, one is led naturally to ask whether this is always 
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Figure 1: Examples of patterns of convergence 
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Figure 2: Examples of patterns of convergence 



true in theory, and if so under what conditions. The rate of convergence of successive 
normalization and related questions are addressed in this paper. 

A natural question to ask is whether the convergence phenomenon observed will still 
occur if simultaneous normalization is undertaken as compared to successive normalization. 
In other words the row and column mean polishing and row and column standard deviation 
polishing is all done at once: 



vt 

v-*+i = —i 

In this case the simultaneous normalization algorithm does not converge. In fact it is 
shown to diverge. Figures 3-8 below illustrate this phenomenon. 
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Figure 3: Two examples of simultaneous normalization: these illustrate that simultaneous 
normalization does not lead to convergence 




Figure 4: (Left) Example of simultaneous normalization, (Right) Same example zoomed in. 




Figure 5: (Left) Example of simultaneous normalization, (Right) Same example zoomed in. 

4 Convergence and Rates 

Theorem 4.1 of [2] is false. A claimed backwards martingale is NOT. Fortunately, all that 
seems damaged by the mistake is pride. Much is true. To establish what this is requires 
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Figure 6: (Left) Example of simultaneous normalization, (Right) Same example zoomed in. 





Figure 7: Two further examples of simultaneous normalization: these illustrate that simul- 
taneous normalization does not lead to convergence 
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Figure 8: Example of simultaneous normalization 



notation. 



X denotes an / x J matrix with values x G We take coordinates Xij(x) = Xjj to be 



iid N(0, 1). As in 0, [3], 3 < min(J, J) < max(I, J) < oo. As before, 

(s; o) ) 2 = jE(^--4 o) ) 2 (i) 

i=i 

= jE(^) 2 -jE^f ) + (^ 0) ) 2 (2) 

1 J 

J 3=1 

= [X\f], where, in an obvious notation x\f = (X lj - X^/S^. 

In view of [2], almost surely (S^) 2 > 0. By analogy, set X^ 2 ) = pQy], where X^ = 
(x\f - X ( f)/Sf\ the definition of (sf ] ) 2 being clear. As in [2J, (sf ] ) 2 > 0. In general, 
for m odd, x[f = {X^- l) - Xt' 1] )/St~ 1] . 

Likewise, for m even x[™ ] = (x\™~ X) - xjp~ 1) )/S^ m ~ 1) . Without loss, we take (sf } ) 2 and 

(Sj 1 ^) 2 to be positive for all (i,j,l). 

In various places we make implicit use of a theorem of Skorokhod, the Heine-Borel 
Theorem, and this obvious fact. Suppose we are given a sequence of random variables Y = 
(Yi, Y2, ...) and a condition Co that depends only on their finite-dimensional distributions. 
If we wish to make a conclusion C\ concerning Y, then it is enough to find one probability 
space that supports Y with given finite dimensional distributions for which Cq implies 
C\. The theorem of Skorokhod mentioned(see pages 6-8 of [1]) is to the effect that if the 
underlying probability space is (for measure theoretic purposes) the real line with measures 
given by, say, distribution functions F = (Fx, F%, ...); and if F is compact with respect to 
weak convergence, where F qi converges to G in distribution; then if each F{ is absolutely 
continuous with respect to some H (which itself is absolutely continuous), and Yi(x) = Fi(x), 
then Y qi converges H- almost surely to a random variable Y. 

Off of a set of probability 0, J2ij( X if) 2 = H f °r all q > 1. We assume that x lies outside 
this "cursed" set. This is key to convergence. 

Almost surely, (S^ 2 " 1 ^) 2 has positive limsup as m increases without bound. 

LetA j = {m m (S^ 1) ) 2 = 0}. 

P(Aj) = j = 1,2, ... J 

Since the entries of X are independent, X is row and column exchangeable. This prop- 
erty is inherited by XW for every q. Because all entries (for q > 1) are a.e. bounded 
uniformly in q, E{X^} and E{(X^) 2 } exist and are finite (with fixed bound that applies 
to all q). Exchangeability implies that all E{x\f] = 0, and all E{(x\f) 2 } = 1. Bounded 
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convergence implies that if {S^ m ^) 2 tends to along a subsequence as m increases, not 
only is the limit bounded as a function of x, but also the limit random variable has expec- 
tation 0. Necessarily every almost sure subsequential limit in m of the random variables 
X ( f " 1} has mean 0. Likewise, every almost sure subsequential limit in m of the random 

variables {X^ m ~^) 2 has expectation 1. All are bounded as functions of x. One consequence 
of these matters is that P(Aj) = 0. The next paragraph is a proof. 

The only subsequential almost sure limits of {{Sf m ~ l) ) 2 : j = 1,2,...} and of {{sf m) f : 
j = 1,2,...} have expectation 1. Fix a j, 1 < j < J. Let E = E(j) = {i : for some 
qi, Pi\\va\x\ q - l \ > 0) > 0}. Column exchangeability implies that for any pair E E(j) 

iff i' G E(j). The first sentence entails that E(j) is not empty. Therefore, E(j) = {1, /}. 

Let iq 7^ i\. There is a subsequence of {qi} - for simplicity write it as {qi} - along which 
almost surely 



(a) limX; *' ' and HmX J 1 ' both exist; 



(b) limX (2 f 1} and lim*f? j) both exist; and 

ii 113 ii 113 

If P(lim qi \xf o f - xf Y f\ = 0) = 1 then exchangeability implies that (sf" 1 ^) 2 -»■ 0. So 



without loss of generality, if E = {\\m qi \xf o f - xfj ] \ > 0} then P(E) > 0. Write 



\ x i a j ~ x i j )-y x hj ~ x hj >-y x ioj - x id - l )l b j 

Since {xf^) 2 - {xf g J ->■ a.s.((a) and the expectations of both are 1), and 

likewise with i a replaced by i\, ((b), etc.), and because x 2 — y 2 = (x — y)(x + y), the first 
expression of the immediately previous display tends to on E. So, too, does the second 

(2 —1) 

expression. This is possible only if Sj 91 — > 1 on E (we take the positive square root). 
On E c , Sf qi ~ 1] -> 0. Since E[sf qi ~ 1] ] -> 1, P(E C ) = 0. Further, 

(2 qi ) _ (2 qi -l) X i o3 {bj ~ l )+ X -j 
^ioj ^ioj ) (2„-l) 



As a corollary, one sees now that X,- qi — > a.s. Since the original {qi} could be taken to 
be an arbitrary subsequence of {q}, we conclude that convergence of row and column means 
to and convergence of row and column standard deviations to 1 takes place everywhere 
except on a set of Lebesgue measure 0. 

THEOREM 4.1 Efron's algorithm converges almost surely for X on a Borel set of entries 
with complement a set of Lebesgue measure 0. 
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We turn now to a study of rates of convergence. To begin, define the following 

Z = limX( m ) a.e. and xf™~ 1] = X^™^ - Z iv 

Now let Xj = maxi | A J • | . Almost everywhere convergence and the fact that for each 
row and each column, not every Zy can be of the same sign enable us to conclude that for 
m large enough 

,-y(2m-l), (I ~ 1) (2m-l) 

Remember that for j = 1, J > 3, j J2i=i(Zij) 2 = 1, and analogously for i = 1, / > 

3. 

Now write 



jE(^r _l) - x -i) 2 -TE(%-°) 2 

i=l i=l 



2 l2 

a — o 



= (a + 6) (a — b), so 
o-l = [(a 2 -b 2 )/a + b]-l 

We know that For all Z\. = Z.j = a.e. To continue, we compute that 

.(2m) v (2m-l)i 



jr(2m-l) . o fVio „_ Ul - Tm „™+ „f fcO™-!^ 

Now write 



where m is the positive square root of {S- 



= [(xff - 1 ) - z^) 2 + {Za - z,f + (z, - x {2m ^f 
+ 2(x^ m - 1) -z^)(z,-z,) 

+ 2(X^ m - 1) -Z J )(Z,-Z ( f- 1) ) 
+ 2(Z ij -?. j )(Z. j -xf n - 1) ) 

One argues that 

i £ (J r<v-D -xf^V < mj + j f - Zy)(Z, - Z,), 

i=l (i=l) 

where D = D(I, J) < oo. A key observation is that for every m, there exists j = j(m) 
for which {X^ m ^} are not of the same or all of the opposite sign as {Zy : i = 1, ...,/}, 
which, as was noted, are not, themselves of the same sign. Argue analogously regarding 
{x[? m ^}. Refer now to Figure El Our arguments show the correctness of the concentric 
circles in ~R IJ on which X^") , x( n+1 ) , x( n+2 ) , • • • ultimately lie. We conclude that to suitable 
approximation, the successive iterates lie on such circles with radii in geometric ratio. That 
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Figure 9: For n large enough log{||X( n+2 ) - X( n+1 ) ||^/||X {n+1) - X( n )||^} < 1 uniformly 

ratio is uniformly < 1, but is not arbitrarily close to 0. However Figure [9] buries a key idea 
in our study of convergence are rates. Thus, write 8^ for the angle between and Z. 
Obviously, the squared Frobenius norm ||X( n ) — Z|||, can be expressed as 

||XW - Z||| = ||XW||! + ||Z||| - (2cos(0 (n) )||x( n )|| F ||Z|| F . 

Now rewrite cos(#( n )) = 1 — 2jj(||X( n ) — Z||^). Therefore, convergence of X^ n ) to Z 
implies that 9^ can be taken arbitrarily close to 0. This is another way of saying that for 
each n, ||XW|||, = IJ. Failure of this condition is why "simultaneous normalization" fails. 

5 Illustrations of Convergence 

We now illustrate the rapidity of convergence in the 3x3 case to give a geometric perspective. 
First note that in the 3x3 case the set of fixed points are characterized by 3 unique values. 
For instance in the following doubly normalized matrix which arises a result of successive 
normalization has only three unique elements. 



-1.4137 0.7407 0.6730 
0.7407 0.6730 -1.4137 
0.6730 -1.4137 0.7407 



(4) 



Hence we can use the first column of the limit matrix to represent the fixed point arising 
from successive normalization. Hence the curve of fixed points can be generated by applying 
the successive normalization process to random starting values. Figures 10-11 below 
illustrates the curve that characterizes the set of fixed points for the 3x3 matrix case 
when this numerical exercise is implemented. The origin is marked in black. The latter 3 
subfigures superimpose the unit circle on the diagram in order to illustrate that the set of 
fixed points represent a "ring" around the unit sphere. Figure 12 considers different starting 



11 



Figure 10: (Left) Set of fixed points, (Right) Set of fixed points + unit sphere 




Figure 11: Set of fixed points + unit sphere from different perspectives 

values and their respective paths of convergence to the "ring" of fixed points. Figures 13 - 
15, presents 3 individual illustrations of different starting values and their respective paths 
of convergence to the "ring" of fixed points. For each example, a magnification or close up 
of the path is provided. It is clear from these diagrams that the algorithm "accelerates" or 
"speeds up" as the sequence nears its limit. 
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Figure 12: Illustration of convergence to fixed points from multiple starts 




Figure 13: (Left) Example of convergence to set of fixed points, (Right) Close up of Example 
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Figure 14: (Left) Example of convergence to set of fixed points, (Right) Close up of Example 

6 Topics for future study 

There are at least several directions for future research that continues to build from [2], [3] 
and from material presented here. For one, Adam Olshen notes that often in applications, 
and for many reasons, X may not be exactly rectangular. Thus, some initial coordinates may 
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Figure 15: (Left) Example of convergence to set of fixed points, (Right) Close up of Example 

be missing. By arguments not reported here we have shown that so long as "missingness" 
is independent of values that would be reported for a full X, and no matter the realization 
there are almost surely genuine data for at least three rows for each column and at least 
three columns for each row, and so long as standard deviations are defined with "correct:" 
divisors, then convergence on all but a set of initial values of measure is plausible. The 
process of rendering rows and columns with means and standard deviations 1 is potentially 
a powerful "hammer" in search of a biological "nail." Just as studying data by (scale-free) 
correlations rather than in original given scales can lend insight, so, too, can the process of 
successive normalization described here. 
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